[Feature] Temporal Pipeline Parallelism & Stream Batch for Real-Time Video#3099
[Feature] Temporal Pipeline Parallelism & Stream Batch for Real-Time Video#3099mnasser02 wants to merge 56 commits into
Conversation
1d341e8 to
4c8e2b7
Compare
|
Ready for full review when WIP status is removed. Preliminary scan available on request. Note: Test plan and test results sections are currently empty. Please provide:
|
8a0f16b to
b34cf19
Compare
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com> Implement Lingbot World Transformer into vllm-omni Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com> Implement KV cache abstraction for Lingbot World Fast Add script to offline generation using Lingbot World Fast Implement online serving for Lingbot World and camera-based world models in general
Implement SupportsStepExecution protocol on Wan22Pipeline, decomposing the monolithic forward() into prepare_encode, denoise_step,step_scheduler, and post_decode. Add denoise_micro_step for temporal PP. Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
>> Different ranks work on different chunks. A context manager that views the req state of a rank as a chunk state allows benefitting from existing functionalities. Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
…ed task Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
… pipeline (B and T hardcoded for now) Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
…instead of sync send/recv Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Plain P2P on size-2 PG triggers lazy sub-comm creation that requires the peer present. Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <52176659+Miguel0312@users.noreply.github.com> Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
Signed-off-by: Miguel Vieira Pereira <miguel.vpereira14@gmail.com>
|
Hi, may I ask if there is an estimated plan for merging this PR? I have some features that need to be built on top of it. |
|
Currently, the core functionality of StreamDiffusionV2 (temporal PP through micro-step execution path and StreamBatchScheduler) should be almost ready. However, we can't test it properly since causal Wan isn't available in vllm-omni (in the paper, SDV2 builds on CausVid applied to Wan); not sure if this is being worked on, or this can be left for a future PR. We have done some tests on a not-so-correct V2V adaptation of Wan2.1-T2V-1.3B to ensure functionality. I think that before the end of this week, we can open the PR (after a cleaning the code a bit) and share our preliminary results. Update: we decided to test on Lingbot #3701 |
…ream-diffusion Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
resolve conflicts please |
…diffusion Signed-off-by: Mahdi Nasser <94046147+mnasser02@users.noreply.github.com>
Purpose
Related issue: #2280.
Tested on Lingbot #3701 (not merged yet).
Implements:
SupportsMicroStepExecutiononLingbotWorldFastPipelinewithStreamVAEStreamBatchSchedulerdriving micro-step execution pipeline, with SLO-adaptive batch sizeNot in scope:
Test Plan
Test Result
Unit tests: 31 passed, 17 warnings
E2E on A100s:
Latency
Each cell:
e2e_s / ttff_s(seconds).A chunk is 12 pixel frames.
Per-rank occupancy (PP=4, num_chunks=20)
Throughput vs GPU count
Micro-step latency distribution (PP=4, num_chunks=10)
PP=4, num_chunks=20
pp4_c20.mp4
Differences in the output from the original can be noticed since chunks can attend to previous chunks that aren't at their final denoising step.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)